Summary

This report is meant to answer the following question. Can I predict how long a ride will from the time and place of departure, as well as some minimal information about the rider?

I'm examining three months of citibike rider data, from July through September of 2019, totaling about 7 million rides. The only significant peculiarity in the data, is a spike in the rider birth year for the year 1969. There are about 7 times more riders "born" in 1969 than the surrounding year. I would assume that 1969 is the default birth year when users create accounts.

As seen in this figure, 50% of all rides are 10 minutes or less and over 90% of all rides are shorter than 30 minutes. While there are less riders on Sundays as compared to the rest of the week, this does not explain daily ridership, as seen here. There is some other external factor, most probably weather, which has a stronger influence on the probability of a ride.

When trying to predict the length of a ride, none of the non-engineered variables are significantly correlated with trip duration, as seen here.

Readme

The code below was run on the following:

  • python 3.7.4
  • pandas 0.25.1
  • pandas-profiling 2.3.0
  • altair 3.3.0.dev0
  • numpy 1.17.2
  • jupyter 1.0.0
  • geopy 1.20.0

Imports and Functions

In [285]:
import pandas as pd
import altair as alt
import numpy as np
import pandas_profiling
from IPython.display import display, IFrame
import os
from geopy import distance
from typing import Tuple
In [2]:
# Run this cell only if you are running this in a jupyter notebook
alt.renderers.enable('notebook')
Out[2]:
RendererRegistry.enable('notebook')
In [294]:
# Run this cell only if you want to export your notebook as an html
alt.renderers.enable('default')
Out[294]:
RendererRegistry.enable('default')
In [171]:
def alt_hist(df: pd.DataFrame, col_of_int: str) -> pd.DataFrame:
    """
    Create a dataframe of the counts of each value
    """
    df_tmp = df[[col_of_int]].copy()
    df_tmp["col_count"] = 1
    df_tmp = df_tmp.groupby(col_of_int, as_index=False, sort=False).count()
    df_tmp = df_tmp.sort_values(col_of_int).reset_index(drop=True)
    return df_tmp
In [61]:
def standard_props(chart: alt.Chart) -> alt.Chart:
    """ 
    Default values for Altair charts
    """
    chart = (
        chart.properties(height=300, width=700)
        .configure_axis(labelFontSize=15, titleFontSize=20)
        .configure_legend(
            titleFontSize=18, labelFontSize=12, titleLimit=250, labelLimit=200
        )
        .configure_header(titleFontSize=20, labelFontSize=15)
    )
    return chart
In [214]:
def train_test_split(
    df: pd.DataFrame, train_smp: int, test_smp: int, rand_state: int = 44
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Split the data into a training and testing dataframe.
    """
    train_smp = min(train_smp, int(df.shape[0] * 0.8))
    test_smp = min(test_smp, int(df.shape[0] * 0.2))
    total_smp = train_smp + test_smp
    df_smp = df.sample(total_smp, random_state=rand_state)
    df_smp_train = df_smp.iloc[:train_smp, :]
    df_smp_test = df_smp.iloc[train_smp:, :]
    return (df_smp_train, df_smp_test)

Download and Concatenate the Data

In [3]:
!mkdir data
!mkdir data/raw
!wget -P data/raw https://s3.amazonaws.com/tripdata/201909-citibike-tripdata.csv.zip
!wget -P data/raw https://s3.amazonaws.com/tripdata/201908-citibike-tripdata.csv.zip
!wget -P data/raw https://s3.amazonaws.com/tripdata/201907-citibike-tripdata.csv.zip
--2019-10-25 12:03:24--  https://s3.amazonaws.com/tripdata/201909-citibike-tripdata.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.134.29
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.134.29|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 104185546 (99M) [application/zip]
Saving to: ‘data/raw/201909-citibike-tripdata.csv.zip’

201909-citibike-tri 100%[===================>]  99.36M  90.0MB/s    in 1.1s    

2019-10-25 12:03:26 (90.0 MB/s) - ‘data/raw/201909-citibike-tripdata.csv.zip’ saved [104185546/104185546]

--2019-10-25 12:03:26--  https://s3.amazonaws.com/tripdata/201908-citibike-tripdata.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 100021943 (95M) [application/zip]
Saving to: ‘data/raw/201908-citibike-tripdata.csv.zip’

201908-citibike-tri 100%[===================>]  95.39M  59.8MB/s    in 1.6s    

2019-10-25 12:03:28 (59.8 MB/s) - ‘data/raw/201908-citibike-tripdata.csv.zip’ saved [100021943/100021943]

--2019-10-25 12:03:28--  https://s3.amazonaws.com/tripdata/201907-citibike-tripdata.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 93359147 (89M) [application/zip]
Saving to: ‘data/raw/201907-citibike-tripdata.csv.zip’

201907-citibike-tri 100%[===================>]  89.03M  85.1MB/s    in 1.0s    

2019-10-25 12:03:29 (85.1 MB/s) - ‘data/raw/201907-citibike-tripdata.csv.zip’ saved [93359147/93359147]

--2019-10-25 12:03:29--  https://s3.amazonaws.com/tripdata/201906-citibike-tripdata.csv.zip
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.36.102
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.36.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 91078031 (87M) [application/zip]
Saving to: ‘data/raw/201906-citibike-tripdata.csv.zip’

201906-citibike-tri 100%[===================>]  86.86M  84.0MB/s    in 1.0s    

2019-10-25 12:03:30 (84.0 MB/s) - ‘data/raw/201906-citibike-tripdata.csv.zip’ saved [91078031/91078031]

In [5]:
!unzip -q data/raw/201909-citibike-tripdata.csv.zip -d data/raw/
!unzip -q data/raw/201908-citibike-tripdata.csv.zip -d data/raw/
!unzip -q data/raw/201907-citibike-tripdata.csv.zip -d data/raw/
In [7]:
cb_csv = [x for x in os.listdir("data/raw") if x.endswith(".csv")]
df_cb_list = []
for csv in cb_csv:
    df_cb_list.append(pd.read_csv(os.path.join("data/raw",csv)))
df_cb = pd.concat(df_cb_list, ignore_index=True, sort=False)
del df_cb_list
df_cb = df_cb.rename(columns={x:x.replace(" ", "_") for x in df_cb.columns})
In [10]:
df_cb.to_feather("data/citibike_data_201907-201909.feather")

Load the Data

In [9]:
df_cb = pd.read_feather("data/citibike_data_201907-201909.feather")
df_cb.shape
Out[9]:
(6970188, 15)
In [11]:
df_cb.starttime = pd.to_datetime(df_cb.starttime, format="%Y-%m-%d %H:%M:%S.%f")
df_cb.stoptime = pd.to_datetime(df_cb.stoptime, format="%Y-%m-%d %H:%M:%S.%f")

Profile the Data

In [13]:
# profile = df_cb.profile_report(title="Pandas Profiling Report")
# profile.to_file(output_file="data/citibike_data_report.html")

EDA

EDA General Guidelines

  • You need to look at the data. Just because pandas says there are no missing values, doesn't mean there aren't thousands of fields that say "none" or "NULL".
  • Look for outliers and values that make no sense and remove, correct, or impute them as necessary. But make sure to explain why you are altering the data
  • Initially, only do the basic EDA/sanity checks. If you have more time, do more, but do not spend your entire alloted time doing EDA

Trip Duration, Start Time, End Time

Sanity Checks

Check for missing values

In [80]:
df_cb[["starttime", "stoptime", "tripduration"]].isna().sum()
Out[80]:
starttime       0
stoptime        0
tripduration    0
dtype: int64

No one arrives before they leave

In [77]:
(df_cb.starttime > df_cb.stoptime).sum()
Out[77]:
0

The trip duration matches the difference between the starttime and the stoptime

In [14]:
df_cb["calculatedduration"] = (df_cb.stoptime - df_cb.starttime).dt.total_seconds().astype(int)
df_cb.loc[df_cb.calculatedduration != df_cb.tripduration, ["calculatedduration", "tripduration"]]
df_cb["duration_diff"] = df_cb.calculatedduration -  df_cb.tripduration
df_cb.duration_diff.value_counts()
Out[14]:
0    6970188
Name: duration_diff, dtype: int64

Trip Duration

In [15]:
df_cb["tripduration_min"] = (df_cb.tripduration/60).astype(int)
In [176]:
col_of_int = "tripduration_min"
col_title = "Trip Duration (Minutes)"
df_trip_dur = alt_hist(df_cb, col_of_int=col_of_int)
df_trip_dur["cum_prcnt"] = df_trip_dur.col_count.cumsum()/df_trip_dur.col_count.sum()
In [295]:
col_of_int = "tripduration_min"
col_title = "Trip Duration (Minutes)"

p = alt.Chart(df_trip_dur).mark_bar().encode(
    x = alt.X(col_of_int, title=col_title, axis=alt.Axis(format="c")),
    y = alt.Y("col_count", title="Count", axis=alt.Axis(format="s"), scale=alt.Scale(type="log", base=10)),
    tooltip=[alt.Tooltip("col_count", title="Count", format="~s"),
            alt.Tooltip(col_of_int, title=col_title)]
)
standard_props(p).interactive()
Out[295]:
In [291]:
IFrame("tripdur_min.html", width="100%", height=400)
Out[291]:

In [185]:
col_of_int = "tripduration_min"
col_title = "Trip Duration (Minutes)"

p1 = alt.Chart(df_trip_dur.loc[df_trip_dur.tripduration_min <= 80]).mark_bar().encode(
    x=alt.X(
        col_of_int, title=col_title, axis=alt.Axis(format="c")
    ),
    y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s")),
    tooltip=[
        alt.Tooltip("col_count", title="Count", format="~s"),
        alt.Tooltip(col_of_int, title=col_title),
    ],
)

p2 = alt.Chart(df_trip_dur.loc[df_trip_dur.tripduration_min <= 80]).mark_line(color="black").encode(
    x=alt.X(
        col_of_int, title=col_title, axis=alt.Axis(format="c")
    ),
    y=alt.Y("cum_prcnt", title="Cumulative Percent", axis=alt.Axis(format="%", grid=True)),
)


standard_props(p1+p2).resolve_scale(y='independent')
Out[185]:

Start Time

In [23]:
df_cb["start_month"] = df_cb.starttime.dt.month
df_cb["start_day"] = df_cb.starttime.dt.day
df_cb["start_date"] = pd.to_datetime(df_cb.starttime.dt.date, format="%Y-%m-%d")
df_cb["start_dow"] = df_cb.starttime.dt.day_name()
df_cb["start_hour"] = df_cb.starttime.dt.hour

In [207]:
col_of_int = "start_date:T"
col_title = "Start Date"

df_start_date = alt_hist(df_cb, col_of_int="start_date")
df_start_date = df_start_date.merge(
    df_cb[["start_date", "start_dow"]].drop_duplicates(), on="start_date", how="left"
)
df_start_date["weekend"] = False
df_start_date.loc[df_start_date.start_dow.isin(["Saturday", "Sunday"]), "weekend"] = True


p = (
    alt.Chart(df_start_date)
    .mark_bar()
    .encode(
        x=alt.X(col_of_int, title=col_title),
        y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s")),
        color=alt.Color("weekend", title="Weekend"),
        tooltip=[
            alt.Tooltip("col_count", title="Count", format="~s"),
            alt.Tooltip(col_of_int, title=col_title),
            alt.Tooltip("start_dow", title="Day of Week"),
        ],
    )
)
standard_props(p).interactive()
Out[207]:

In [208]:
col_of_int = "start_dow"
col_title = "Day of Week"
df_start_dow = df_start_date[["col_count", "start_dow"]].groupby("start_dow", as_index=False, sort=False).mean()

alt.Chart(df_start_dow).mark_bar().encode(
    x=alt.X(
        col_of_int,
        title=col_title,
        sort=[
            "Sunday",
            "Monday",
            "Tuesday",
            "Wednesday",
            "Thursday",
            "Friday",
            "Saturday",
        ],
    ),
    y=alt.Y(
        "col_count",
        title="Average Count",
        axis=alt.Axis(format="s"),
        scale=alt.Scale(zero=False),
    ),
    tooltip=[
        alt.Tooltip("col_count", title="Average Count", format="~s"),
        alt.Tooltip(col_of_int, title=col_title),
    ],
).properties(height=300, width=800).interactive()
Out[208]:

In [73]:
col_of_int = "start_hour"
col_title = "Hour of Day"
df_start_hour = alt_hist(df_cb, col_of_int=col_of_int)

alt.Chart(df_start_hour).mark_bar().encode(
    x=alt.X(
        col_of_int,
        title=col_title,
    ),
    y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s"), scale=alt.Scale(zero=False)),
    tooltip=[
        alt.Tooltip("col_count", title="Count", format="~s"),
        alt.Tooltip(col_of_int, title=col_title),
    ],
).properties(height=300, width=800)
Out[73]:

Start Stations and End Stations

Sanity Checks

Missing Values

In [86]:
df_cb.columns
Out[86]:
Index(['tripduration', 'starttime', 'stoptime', 'start_station_id',
       'start_station_name', 'start_station_latitude',
       'start_station_longitude', 'end_station_id', 'end_station_name',
       'end_station_latitude', 'end_station_longitude', 'bikeid', 'usertype',
       'birth_year', 'gender', 'start_month', 'start_day', 'tripduration_min',
       'start_date', 'start_hour', 'start_dow', 'calculatedduration',
       'duration_diff'],
      dtype='object')
In [87]:
df_cb[
    [
        "start_station_id",
        "start_station_name",
        "start_station_latitude",
        "start_station_longitude",
        "end_station_id",
        "end_station_name",
        "end_station_latitude",
        "end_station_longitude",
    ]
].isna().sum()
Out[87]:
start_station_id           143
start_station_name         143
start_station_latitude       0
start_station_longitude      0
end_station_id             143
end_station_name           143
end_station_latitude         0
end_station_longitude        0
dtype: int64

Check for any empty strings

In [104]:
df_cb.loc[(df_cb.start_station_name.notna()) & (df_cb.start_station_name.str.match("^\s*$"))]
Out[104]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... birth_year gender start_month start_day tripduration_min start_date start_hour start_dow calculatedduration duration_diff

0 rows × 23 columns

In [105]:
df_cb.loc[(df_cb.end_station_name.notna()) & (df_cb.end_station_name.str.match("^\s*$"))]
Out[105]:
tripduration starttime stoptime start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude ... birth_year gender start_month start_day tripduration_min start_date start_hour start_dow calculatedduration duration_diff

0 rows × 23 columns

Check for any null/none/test or equivalent strings

In [108]:
df_cb.loc[df_cb.start_station_name.str.len()<10].start_station_name.unique()
Out[108]:
array(['Cherry St'], dtype=object)
In [109]:
df_cb.loc[df_cb.end_station_name.str.len()<10].end_station_name.unique()
Out[109]:
array(['Cherry St', 'Grand St', 'Hilltop', 'City Hall', 'Sip Ave',
       'York St', 'Union St'], dtype=object)
In [110]:
set(df_cb.start_station_name.tolist()  + df_cb.end_station_name.tolist())
Out[110]:
{'1 Ave & E 110 St',
 '1 Ave & E 16 St',
 '1 Ave & E 18 St',
 '1 Ave & E 30 St',
 '1 Ave & E 44 St',
 '1 Ave & E 62 St',
 '1 Ave & E 68 St',
 '1 Ave & E 78 St',
 '1 Ave & E 94 St',
 '1 Pl & Clinton St',
 '10 Ave & W 28 St',
 '10 Hudson Yards',
 '10 St & 2 Ave',
 '10 St & 5 Ave',
 '10 St & 7 Ave',
 '11 Ave & W 27 St',
 '11 Ave & W 41 St',
 '11 Ave & W 59 St',
 '11 St & 35 Ave',
 '11 St & 43 Ave',
 '12 Ave & W 40 St',
 '12 St & 4 Ave',
 '14 St & 5 Ave',
 '14 St & 7 Ave',
 '19 St & 24 Ave',
 '2 Ave & 36 St - Citi Bike HQ at Industry City',
 '2 Ave & E 104 St',
 '2 Ave & E 122 St',
 '2 Ave & E 31 St',
 '2 Ave & E 72 St',
 '2 Ave & E 96 St',
 '2 Ave & E 99 St',
 '21 St & 31 Dr',
 '21 St & 36 Ave',
 '21 St & 38 Ave',
 '21 St & 43 Ave',
 '21 St & Hoyt Ave S',
 '21 St & Queens Plaza North',
 '23 Ave & 27 St',
 '24 Ave & 26 St',
 '24 Ave & 29 St',
 '24 St & 41 Ave',
 '27 Ave & 3 St',
 '27 Ave & 4 St',
 '27 Ave & 9 St',
 '27 St & Hunter St',
 '28 Ave & 35 St',
 '28 Ave & 44 St',
 '28 St & 36 Ave',
 '28 St & 38 Ave',
 '28 St & 41 Ave',
 '3 Ave & 14 St',
 '3 Ave & Carroll St',
 '3 Ave & E 100 St',
 '3 Ave & E 112 St',
 '3 Ave & E 62 St',
 '3 Ave & E 71 St',
 '3 Ave & E 72 St',
 '3 Ave & E 95 St',
 '3 Ave & Schermerhorn St',
 '3 St & 3 Ave',
 '3 St & 7 Ave',
 '3 St & Hoyt St',
 '3 St & Prospect Park West',
 '30 Ave & 21 St',
 '31 Ave & 14 St',
 '31 Ave & 30 St',
 '31 Ave & 34 St',
 '31 Ave & Crescent St',
 '31 Ave & Steinway St',
 '31 St & 23 Ave',
 '31 St & Astoria Blvd',
 '31 St & Broadway',
 '31 St & Hoyt Ave N',
 '31 St & Newtown Ave',
 '31 St & Northern Blvd',
 '31 St & Thomson Ave',
 '34 Ave & 13 St',
 '34 Ave & 21 St',
 '34 Ave & 38 St',
 '34 St & 35 Ave',
 '35 Ave & 10 St',
 '35 Ave & 37 St',
 '35 St & 34 Ave',
 '35 St & Broadway',
 '36 Ave & 10 St',
 '36 Ave & 31 St',
 '37 Ave & 35 St',
 '37 St & 24 Ave',
 '39 St & 2 Ave - Citi Bike HQ at Industry City',
 '4 Ave & 9 St',
 '40 Ave & 9 St',
 '40 Ave & Crescent St',
 '43 St & Broadway',
 '44 Dr & 21 St',
 '44 Dr & Jackson Ave',
 '45 Rd & 11 St',
 '46 Ave & 5 St',
 '47 Ave & 31 St',
 '47 Ave & Skillman Ave',
 '48 Ave & 30 Pl',
 '48 Ave & 5 St',
 '49 Ave & 21 St',
 '5 Ave & 3 St',
 '5 Ave & E 103 St',
 '5 Ave & E 126 St',
 '5 Ave & E 29 St',
 '5 Ave & E 63 St',
 '5 Ave & E 73 St',
 '5 Ave & E 78 St',
 '5 Ave & E 88 St',
 '5 Ave & E 93 St',
 '5 St & 51 Ave',
 '5 St & 6 Ave',
 '5 St & Market St',
 '6 Ave & 12 St',
 '6 Ave & 9 St',
 '6 Ave & Broome St',
 '6 Ave & Canal St',
 '6 Ave & W 33 St',
 '6 Ave & W 34 St',
 '6 St & 7 Ave',
 '7 Ave & Central Park South',
 '7 Ave & Park Pl',
 '7 St & 3 Ave',
 '7 St & 5 Ave',
 '8 Ave & W 31 St',
 '8 Ave & W 33 St',
 '8 Ave & W 52 St',
 '9 Ave & W 18 St',
 '9 Ave & W 22 St',
 '9 Ave & W 28 St',
 '9 Ave & W 45 St',
 '9 St & 44 Rd',
 'Adam Clayton Powell Blvd & W 115 St',
 'Adam Clayton Powell Blvd & W 118 St',
 'Adam Clayton Powell Blvd & W 123 St',
 'Adam Clayton Powell Blvd & W 126 St',
 'Adelphi St & Myrtle Ave',
 'Albany Ave & Fulton St',
 'Allen St & Hester St',
 'Allen St & Rivington St',
 'Allen St & Stanton St',
 'Amsterdam Ave & W 119 St',
 'Amsterdam Ave & W 125 St',
 'Amsterdam Ave & W 66 St',
 'Amsterdam Ave & W 73 St',
 'Amsterdam Ave & W 79 St',
 'Amsterdam Ave & W 82 St',
 'Astoria Park S & Shore Blvd',
 'Atlantic Ave & Fort Greene Pl',
 'Atlantic Ave & Furman St',
 'Avenue C & E 18 St',
 'Avenue D & E 12 St',
 'Avenue D & E 3 St',
 'Avenue D & E 8 St',
 'Baltic St & 5 Ave',
 'Bank St & Hudson St',
 'Bank St & Washington St',
 'Banker St & Meserole Ave',
 'Barclay St & Church St',
 'Barrow St & Hudson St',
 'Bayard St & Baxter St',
 'Bayard St & Leonard St',
 'Beaver St & Fayette St',
 'Bedford Ave & Bergen St',
 'Bedford Ave & Montgomery St',
 'Bedford Ave & Nassau Ave',
 'Bedford Ave & S 9 St',
 'Bergen St & Flatbush Ave',
 'Bergen St & Smith St',
 'Bergen St & Vanderbilt Ave',
 'Berkeley Pl & 6 Ave',
 'Berkeley Pl & 7 Ave',
 'Berry St & N 8 St',
 'Bialystoker Pl & Delancey St',
 'Boerum St & Broadway',
 'Bond St & Bergen St',
 'Bond St & Fulton St',
 'Broad St & Bridge St',
 'Broadway & 12 St',
 'Broadway & Battery Pl',
 'Broadway & Berry St',
 'Broadway & E 14 St',
 'Broadway & E 22 St',
 'Broadway & Moylan Pl',
 'Broadway & Roebling St',
 'Broadway & W 122 St',
 'Broadway & W 25 St',
 'Broadway & W 29 St',
 'Broadway & W 32 St',
 'Broadway & W 36 St',
 'Broadway & W 37 St',
 'Broadway & W 38 St',
 'Broadway & W 41 St',
 'Broadway & W 49 St',
 'Broadway & W 51 St',
 'Broadway & W 53 St',
 'Broadway & W 56 St',
 'Broadway & W 60 St',
 'Broadway & Whipple St',
 'Brooklyn Bridge Park - Pier 2',
 'Brunswick St',
 'Bus Slip & State St',
 'Bushwick Ave & Dekalb Ave',
 'Bushwick Ave & Forrest St',
 'Bushwick Ave & McKibbin St',
 'Bushwick Ave & Powers St',
 'Bushwick Ave & Stagg St',
 'Butler St & Court St',
 'Cadman Plaza E & Johnson St',
 'Cadman Plaza E & Red Cross Pl',
 'Cadman Plaza West & Montague St',
 'Calyer St & Jewel St',
 'Cambridge Pl & Gates Ave',
 'Canal St & Rutgers St',
 'Carlton Ave & Dean St',
 'Carlton Ave & Flushing Ave',
 'Carlton Ave & Park Ave',
 'Carmine St & 6 Ave',
 'Carroll St & 5 Ave',
 'Carroll St & 6 Ave',
 'Carroll St & Bond St',
 'Carroll St & Columbia St',
 'Carroll St & Franklin Ave',
 'Carroll St & Smith St',
 'Carroll St & Washington Ave',
 'Cathedral Pkwy & Broadway',
 'Catherine St & Monroe St',
 'Cedar St & Myrtle Ave',
 'Center Blvd & 48 Ave',
 'Center Blvd & 51 Ave',
 'Central Ave & Flushing Ave',
 'Central Ave & Starr Street',
 'Central Park North & Adam Clayton Powell Blvd',
 'Central Park S & 6 Ave',
 'Central Park W & W 96 St',
 'Central Park West & W 100 St',
 'Central Park West & W 102 St',
 'Central Park West & W 68 St',
 'Central Park West & W 72 St',
 'Central Park West & W 76 St',
 'Central Park West & W 82 St',
 'Central Park West & W 85 St',
 'Centre St & Chambers St',
 'Centre St & Worth St',
 'Cherry St',
 'Christopher St & Greenwich St',
 'City Hall',
 'Clark St & Henry St',
 'Classon Ave & St Marks Ave',
 'Clermont Ave & Lafayette Ave',
 'Clermont Ave & Park Ave',
 'Cleveland Pl & Spring St',
 'Cliff St & Fulton St',
 'Cliff St & Fulton St (Old)',
 'Clinton Ave & Flushing Ave',
 'Clinton Ave & Myrtle Ave',
 'Clinton St & 4 Place',
 'Clinton St & Centre St',
 'Clinton St & Grand St',
 'Clinton St & Joralemon St',
 'Clinton St & Union St',
 'Coffey St & Conover St',
 'Columbia Heights & Cranberry St',
 'Columbia St & Degraw St',
 'Columbia St & Kane St',
 'Columbia St & Lorraine St',
 'Columbia St & Rivington St',
 'Columbia St & W 9 St',
 'Columbus Ave & W 103 St',
 'Columbus Ave & W 72 St',
 'Columbus Ave & W 95 St',
 'Columbus Drive',
 'Commerce St & Van Brunt St',
 'Concord St & Bridge St',
 'Congress St & Clinton St',
 'Cooper Square & Astor Pl',
 'Court St & Nelson St',
 'Court St & State St',
 'Crescent St & 30 Ave',
 'Crescent St & 34 Ave',
 'Crescent St & 35 Ave',
 'Crescent St & Broadway',
 'Crescent St & Ditmars Blvd',
 'Crown St & Bedford Ave',
 'Cumberland St & Lafayette Ave',
 'DeKalb Ave & Franklin Ave',
 'DeKalb Ave & Hudson Ave',
 'DeKalb Ave & S Portland Ave',
 'DeKalb Ave & Vanderbilt Ave',
 'Dean St & 4 Ave',
 'Dean St & Franklin Ave',
 'Dean St & Hoyt St',
 'Degraw St & Hoyt St',
 'Degraw St & Smith St',
 'Devoe St & Lorimer St',
 'Devoe St & Morgan Ave',
 'Ditmars Blvd & 19 St',
 'Division Av & Hooper St',
 'Division Ave & Marcy Ave',
 'Division St & Bowery',
 'Dock 72 Way & Market St',
 'Douglass St & 3 Ave',
 'Douglass St & 4 Ave',
 'Driggs Ave & Lorimer St',
 'Driggs Ave & N 9 St',
 'Driggs Ave & N Henry St',
 'Duane St & Greenwich St',
 'Duffield St & Willoughby St',
 'Dwight St & Van Dyke St',
 'E 10 St & 5 Ave',
 'E 10 St & Avenue A',
 'E 102 St & 1 Ave',
 'E 102 St & Park Ave',
 'E 103 St & Lexington Ave',
 'E 106 St & 1 Ave',
 'E 106 St & Lexington Ave',
 'E 106 St & Madison Ave',
 'E 109 St & 3 Ave',
 'E 11 St & 1 Ave',
 'E 11 St & 2 Ave',
 'E 11 St & Avenue B',
 'E 110 St & Madison Ave',
 'E 114 St & 1 Ave',
 'E 115 St & Lexington Ave',
 'E 115 St & Madison Ave',
 'E 116 St & 2 Ave',
 'E 118 St & 1 Ave',
 'E 118 St & 3 Ave',
 'E 118 St & Madison Ave',
 'E 118 St & Park Ave',
 'E 12 St & 3 Ave',
 'E 12 St & 4 Av',
 'E 123 St & Lexington Ave',
 'E 128 St & Madison Ave',
 'E 13 St & Avenue A',
 'E 15 St & 3 Ave',
 'E 16 St & 5 Ave',
 'E 16 St & Irving Pl',
 'E 17 St & Broadway',
 'E 19 St & 3 Ave',
 'E 2 St & 2 Ave',
 'E 2 St & Avenue A',
 'E 2 St & Avenue B',
 'E 2 St & Avenue C',
 'E 20 St & 2 Ave',
 'E 20 St & FDR Drive',
 'E 20 St & Park Ave',
 'E 23 St & 1 Ave',
 'E 24 St & Park Ave S',
 'E 25 St & 1 Ave',
 'E 25 St & 2 Ave',
 'E 27 St & 1 Ave',
 'E 30 St & Park Ave S',
 'E 31 St & 3 Ave',
 'E 32 St & Park Ave',
 'E 33 St & 1 Ave',
 'E 33 St & 5 Ave',
 'E 39 St & 2 Ave',
 'E 39 St & 3 Ave',
 'E 4 St & 2 Ave',
 'E 41 St & Madison Ave',
 'E 43 St & 2 Ave',
 'E 43 St & 5 Ave',
 'E 45 St & 3 Ave',
 'E 47 St & 1 Ave',
 'E 47 St & 2 Ave',
 'E 47 St & Park Ave',
 'E 48 St & 3 Ave',
 'E 48 St & 5 Ave',
 'E 5 St & Avenue C',
 'E 51 St & 1 Ave',
 'E 51 St & Lexington Ave',
 'E 52 St & 2 Ave',
 'E 53 St & 3 Ave',
 'E 53 St & Madison Ave',
 'E 55 St & 2 Ave',
 'E 55 St & 3 Ave',
 'E 55 St & Lexington Ave',
 'E 56 St & 3 Ave',
 'E 58 St &  1 Ave (NW Corner)',
 'E 58 St & 1 Ave (NE Corner)',
 'E 58 St & 3 Ave',
 'E 58 St & Madison Ave',
 'E 59 St & Madison Ave',
 'E 6 St & Avenue B',
 'E 6 St & Avenue D',
 'E 60 St & York Ave',
 'E 65 St & 2 Ave',
 'E 66 St & Madison Ave',
 'E 67 St & Park Ave',
 'E 68 St & 3 Ave',
 'E 68 St & Madison Ave',
 'E 7 St & Avenue A',
 'E 72 St & Park Ave',
 'E 72 St & York Ave',
 'E 74 St & 1 Ave',
 'E 75 St & 3 Ave',
 'E 76 St & Park Ave',
 'E 77 St & 3 Ave',
 'E 78 St & 2 Ave',
 'E 80 St & Park Ave',
 'E 81 St & 2 Ave',
 'E 81 St & 3 Ave',
 'E 81 St & Park Ave',
 'E 81 St & York Ave',
 'E 82 St & East End Ave',
 'E 84 St & 1 Ave',
 'E 84 St & 3 Ave',
 'E 84 St & Park Ave',
 'E 85 St & York Ave',
 'E 88 St & 1 Ave',
 'E 88 St & Park Ave',
 'E 89 St & 3 Ave',
 'E 89 St & York Ave',
 'E 9 St & Avenue C',
 'E 91 St & 2 Ave',
 'E 91 St & Park Ave',
 'E 93 St & 2 Ave',
 'E 95 St & 3 Ave',
 'E 97 St & 3 Ave',
 'E 97 St & Madison Ave',
 'E 98 St & Park Ave',
 'East End Ave & E 86 St',
 'Eastern Pkwy & Franklin Ave',
 'Eastern Pkwy & Washington Ave',
 'Eckford St & Engert Ave',
 'Emerson Pl & Myrtle Ave',
 'Evergreen Ave & Noll St',
 'FDR Drive & E 35 St',
 'Flushing Ave & Vanderbilt Ave',
 'Forsyth St & Broome St',
 'Forsyth St & Canal St',
 'Franklin Ave & Empire Blvd',
 'Franklin Ave & Myrtle Ave',
 'Franklin Ave & St Marks Ave',
 'Franklin St & Dupont St',
 'Franklin St & W Broadway',
 'Frederick Douglass Blvd & W 112 St',
 'Frederick Douglass Blvd & W 115 St',
 'Frederick Douglass Blvd & W 117 St',
 'Frederick Douglass Blvd & W 129 St',
 'Front St & Gold St',
 'Front St & Jay St',
 'Front St & Maiden Ln',
 'Front St & Washington St',
 'Frost St & Meeker Ave',
 'Fulton St & Adams St',
 'Fulton St & Broadway',
 'Fulton St & Clermont Ave',
 'Fulton St & Grand Ave',
 'Fulton St & Irving Pl',
 'Fulton St & Rockwell Pl',
 'Fulton St & Utica Ave',
 'Fulton St & Waverly Ave',
 'Fulton St & William St',
 'Gansevoort St & Hudson St',
 'Garfield Pl & 8 Ave',
 'George St & Wilson Ave',
 'Graham Ave & Conselyea St',
 'Graham Ave & Grand St',
 'Graham Ave & Herbert St',
 'Graham Ave & Withers St',
 'Grand Army Plaza & Central Park S',
 'Grand Army Plaza & Plaza St West',
 'Grand Ave & Bergen St',
 'Grand St',
 'Grand St & Elizabeth St',
 'Grand St & Havemeyer St',
 'Great Jones St',
 'Green St & McGuinness Blvd',
 'Greene Av & Myrtle Av',
 'Greene Ave & Nostrand Ave',
 'Greene Ave & Throop Ave',
 'Greenpoint Ave & Manhattan Ave',
 'Greenwich Ave & 8 Ave',
 'Greenwich Ave & Charles St',
 'Greenwich St & W Houston St',
 'Halsey St & Tompkins Ave',
 'Hamilton Park',
 'Hancock St & Bedford Ave',
 'Handlebar @ Gowanus',
 'Hanson Pl & Ashland Pl',
 'Harborside',
 'Harrison Pl & Porter Ave',
 'Harrison St & Hudson St',
 'Hart St & Wyckoff Ave',
 'Heights Elevator',
 'Henry St & Atlantic Ave',
 'Henry St & Bay St',
 'Henry St & Degraw St',
 'Henry St & Grand St',
 'Henry St & Middagh St',
 'Henry St & W 9 St',
 'Hicks St & Montague St',
 'Hilltop',
 'Hoboken Ave at Monmouth St',
 'Hope St & Union Ave',
 'Howard St & Centre St',
 'Hudson St & Reade St',
 'Humboldt St & Varet St',
 'Huron St & Franklin St',
 'India St & Manhattan Ave',
 'India St & West St',
 'Irving Ave & DeKalb Ave',
 'Irving Ave & Jefferson St',
 'JC Medical Center',
 'JCBS Depot',
 'Jackson Ave & 46 Rd',
 'Jackson St & Leonard St',
 'Jay St & Tech Pl',
 'Jay St & York St',
 'Jersey & 6th St',
 'John St & William St',
 'Johnson St & Gold St',
 'Journal Square',
 'Kane St & Clinton St',
 'Kent Ave & N 7 St',
 'Kent Ave & S 11 St',
 'Kent St & McGuinness Blvd',
 'Kingston Ave & Herkimer St',
 'Knickerbocker Ave & George St',
 'Knickerbocker Ave & Thames St',
 'Kosciuszko St & Nostrand Ave',
 'Kosciuszko St & Tompkins Ave',
 'LaGuardia Pl & W 3 St',
 'Lafayette Ave & Classon Ave',
 'Lafayette Ave & Fort Greene Pl',
 'Lafayette Ave & St James Pl',
 'Lafayette St & E 8 St',
 'Lafayette St & Jersey St',
 'Lafayette St & Jersey St S',
 'Laight St & Hudson St',
 'Lawrence St & Willoughby St',
 'Lefferts Pl & Franklin Ave',
 'Lenox Ave & W 111 St',
 'Lenox Ave & W 115 St',
 'Lenox Ave & W 117 St',
 'Lenox Ave & W 126 St',
 'Lenox Ave & W 130 St',
 'Leonard St & Boerum St',
 'Leonard St & Church St',
 'Leonard St & Maujer St',
 'Leonard St & Nassau Ave',
 'Lewis Ave & Decatur St',
 'Lewis Ave & Kosciuszko St',
 'Lewis Ave & Madison St',
 'Lexington Ave & Classon Ave',
 'Lexington Ave & E 111 St',
 'Lexington Ave & E 120 St',
 'Lexington Ave & E 127 St',
 'Lexington Ave & E 26 St',
 'Lexington Ave & E 29 St',
 'Lexington Ave & E 36 St',
 'Lexington Ave & E 63 St',
 'Liberty Light Rail',
 'Liberty St & Broadway',
 'Lincoln Pl & Classon Ave',
 'Lispenard St & Broadway',
 'Little West St & 1 Pl',
 'Lorimer St & Broadway',
 'MacDougal St & Prince St',
 'MacDougal St & Washington Sq',
 'Macon St & Nostrand Ave',
 'Madison Ave & E 120 St',
 'Madison Ave & E 82 St',
 'Madison Ave & E 99 St',
 'Madison St & Clinton St',
 'Madison St & Montgomery St',
 'Maiden Ln & Pearl St',
 'Marcus Garvey Blvd & Macon St',
 'Marcy Ave & Lafayette Ave',
 'Marin Light Rail',
 'Market St & Cherry St',
 'McGuinness Blvd & Eagle St',
 'McKibbin St & Bogart St',
 'McKibbin St & Manhattan Ave',
 'Mercer St & Bleecker St',
 'Mercer St & Spring St',
 'Meserole Ave & Manhattan Ave',
 'Metropolitan Ave & Bedford Ave',
 'Metropolitan Ave & Meeker Ave',
 'Metropolitan Ave & Vandervoort Ave',
 'Milton St & Franklin St',
 'Monmouth and 6th',
 'Monroe St & Bedford Ave',
 'Monroe St & Classon Ave',
 'Monroe St & Tompkins Ave',
 'Montgomery St',
 'Montgomery St & Franklin Ave',
 'Montrose Ave & Bushwick Ave',
 'Morgan Ave & Maspeth Ave',
 'Morningside Ave & W 123 St',
 'Morningside Dr & Amsterdam Ave',
 'Morris Canal',
 'Mott St & Prince St',
 'Mt Morris Park W & W 120 St',
 'Murray St & Greenwich St',
 'Murray St & West St',
 'Myrtle Ave & Lewis Ave',
 'Myrtle Ave & Marcy Ave',
 'Myrtle Ave & St Edwards St',
 'N 11 St & Kent Ave',
 'N 12 St & Bedford Ave',
 'N 15 St & Wythe Ave',
 'N 6 St & Bedford Ave',
 'NYCBS DEPOT - DELANCEY',
 'NYCBS Depot - GOW',
 'NYCBS Depot BAL - DYR',
 'Nassau Ave & Newell St',
 'Nassau Ave & Russell St',
 'Nassau St & Navy St',
 'Newport PATH',
 'Newport Pkwy',
 'Newton Rd & 44 St',
 'Newtown Ave & 23 St',
 None,
 'Norfolk St & Broome St',
 'North Moore St & Greenwich St',
 'Nostrand Ave & Myrtle Ave',
 'Old Fulton St',
 'Old Slip & Front St',
 'Pacific St & Classon Ave',
 'Pacific St & Nevins St',
 'Park Ave & E 124 St',
 'Park Ave & Marcus Garvey Blvd',
 'Park Ave & St Edwards St',
 'Park Pl & Church St',
 'Park Pl & Franklin Ave',
 'Park Pl & Vanderbilt Ave',
 'Paulus Hook',
 'Pearl St & Anchorage Pl',
 'Pearl St & Hanover Square',
 'Peck Slip & Front St',
 'Perry St & Bleecker St',
 'Pershing Field',
 'Pershing Square North',
 'Pershing Square South',
 'Picnic Point',
 'Pier 40 - Hudson River Park',
 'Pierrepont St & Monroe Pl',
 'Pike St & E Broadway',
 'Pike St & Monroe St',
 'Pioneer St & Richards St',
 'Pioneer St & Van Brunt St',
 'Pitt St & Stanton St',
 'Plaza St West & Flatbush Ave',
 'Pleasant Ave & E 116 St',
 'Pleasant Ave & E 120 St',
 'Powers St & Olive St',
 'President St & Henry St',
 'Prospect Park West & 8 St',
 'Prospect Pl & 6 Ave',
 'Prospect Pl & Underhill Ave',
 'Pulaski St & Marcus Garvey Blvd',
 'Putnam Ave & Nostrand Ave',
 'Putnam Ave & Throop Ave',
 'Queens Plaza North & Crescent St',
 'Railroad Ave & Kay Ave',
 'Reade St & Broadway',
 'Richards St & Delavan St',
 'Richardson St & N Henry St',
 'Riverside Blvd & W 67 St',
 'Riverside Dr & W 104 St',
 'Riverside Dr & W 72 St',
 'Riverside Dr & W 78 St',
 'Riverside Dr & W 82 St',
 'Riverside Dr & W 91 St',
 'Rivington St & Chrystie St',
 'Rivington St & Ridge St',
 'Roebling St & N 4 St',
 'Rogers Ave & Sterling St',
 'S 3 St & Bedford Ave',
 'S 4 St & Rodney St',
 'S 4 St & Wythe Ave',
 'S 5 Pl & S 5 St',
 'S Portland Ave & Hanson Pl',
 'Schermerhorn St & Bond St',
 'Schermerhorn St & Court St',
 'Scholes St & Manhattan Ave',
 'Sharon St & Olive St',
 'Sigourney St & Columbia St',
 'Sip Ave',
 'Smith St & 3 St',
 'Smith St & 9 St',
 'Soissons Landing',
 'South End Ave & Liberty St',
 'South St & Gouverneur Ln',
 'South St & Whitehall St',
 'Spruce St & Nassau St',
 'St James Pl & Oliver St',
 'St James Pl & Pearl St',
 'St Johns Pl & Washington Ave',
 'St Marks Pl & 1 Ave',
 'St Marks Pl & 2 Ave',
 'St Nicholas Ave & Manhattan Ave',
 'St. Nicholas Ave & W 126 St',
 'Stagg St & Morgan Ave',
 'Stagg St & Union Ave',
 'Stanton St & Chrystie St',
 'Stanton St & Mangin St',
 'Stanton St & Norfolk St',
 'State St & Smith St',
 'Steinway St & 23 Ave',
 'Steinway St & 28 Ave',
 'Sterling Pl & Bedford Ave',
 'Sterling St & Bedford Ave',
 'Stockholm St & Wilson Ave',
 'Stuyvesant Walk & 1 Av Loop',
 'Suffolk St & Stanton St',
 'Sullivan Pl & Bedford Ave',
 'Sullivan St & Washington Sq',
 'Suydam St & Knickerbocker Ave',
 'Throop Ave & Myrtle Ave',
 'Tompkins Ave & Hopkins St',
 'Underhill Ave & Lincoln Pl',
 'Underhill Ave & Pacific St',
 'Union Ave & Jackson St',
 'Union Ave & N 12 St',
 'Union Ave & Wallabout St',
 'Union St',
 'Union St & 4 Ave',
 'Union St & Bedford Ave',
 'Union St & Nevins St',
 'University Pl & E 14 St',
 'University Pl & E 14 St (old)',
 'University Pl & E 8 St',
 'Van Brunt St & Van Dyke St',
 'Van Brunt St & Wolcott St',
 'Van Vorst Park',
 'Vernon Blvd & 10 St',
 'Vernon Blvd & 30 Rd',
 'Vernon Blvd & 31 Ave',
 'Vernon Blvd & 41 Rd',
 'Vernon Blvd & 47 Rd',
 'Vernon Blvd & 50 Ave',
 'Verona Pl & Fulton St',
 'Vesey Pl & River Terrace',
 'W 100 St & Broadway',
 'W 100 St & Manhattan Ave',
 'W 104 St & Amsterdam Ave',
 'W 106 St & Amsterdam Ave',
 'W 106 St & Central Park West',
 'W 107 St & Columbus Ave',
 'W 11 St & 6 Ave',
 'W 110 St & Amsterdam Ave',
 'W 113 St & Broadway',
 'W 116 St & Amsterdam Ave',
 'W 116 St & Broadway',
 'W 120 St & Claremont Ave',
 'W 129 St & Convent Ave',
 'W 13 St & 5 Ave',
 'W 13 St & 7 Ave',
 'W 15 St & 10 Ave',
 'W 15 St & 6 Ave',
 'W 15 St & 7 Ave',
 'W 16 St & 8 Ave',
 'W 16 St & The High Line',
 'W 17 St & 8 Ave',
 'W 18 St & 6 Ave',
 'W 18 St & 9 Ave',
 'W 20 St & 11 Ave',
 'W 20 St & 7 Ave',
 'W 20 St & 8 Ave',
 'W 21 St & 6 Ave',
 'W 22 St & 10 Ave',
 'W 22 St & 8 Ave',
 'W 24 St & 7 Ave',
 'W 25 St & 6 Ave',
 'W 26 St & 10 Ave',
 'W 26 St & 8 Ave',
 'W 27 St & 10 Ave',
 'W 27 St & 7 Ave',
 'W 31 St & 7 Ave',
 'W 33 St & 7 Ave',
 'W 34 St & 11 Ave',
 'W 35 St & Dyer Ave',
 'W 37 St & 10 Ave',
 'W 37 St & 5 Ave',
 'W 38 St & 8 Ave',
 'W 39 St & 9 Ave',
 'W 4 St & 7 Ave S',
 'W 40 St & 5 Ave',
 'W 41 St & 8 Ave',
 'W 42 St & 8 Ave',
 'W 42 St & Dyer Ave',
 'W 43 St & 10 Ave',
 'W 43 St & 6 Ave',
 'W 44 St & 5 Ave',
 'W 45 St & 6 Ave',
 'W 45 St & 8 Ave',
 'W 46 St & 11 Ave',
 'W 47 St & 10 Ave',
 'W 49 St & 8 Ave',
 'W 50 St & 9 Ave',
 'W 52 St & 11 Ave',
 'W 52 St & 5 Ave',
 'W 52 St & 6 Ave',
 'W 53 St & 10 Ave',
 'W 54 St & 9 Ave',
 'W 55 St & 6 Ave',
 'W 56 St & 10 Ave',
 'W 59 St & 10 Ave',
 'W 63 St & Broadway',
 'W 64 St & Thelonious Monk Circle',
 'W 67 St & Broadway',
 'W 70 St & Amsterdam Ave',
 'W 74 St & Columbus Ave',
 'W 76 St & Columbus Ave',
 'W 78 St & Broadway',
 'W 82 St & Central Park West',
 'W 84 St & Broadway',
 'W 84 St & Columbus Ave',
 'W 87 St & Amsterdam Ave',
 'W 87 St & West End Ave',
 'W 88 St & West End Ave',
 'W 89 St & Columbus Ave',
 'W 90 St & Amsterdam Ave',
 'W 92 St & Broadway',
 'W 95 St & Broadway',
 'W Broadway & Spring Street',
 'Warren St & Court St',
 'Washington Ave & Greene Ave',
 'Washington Ave & Park Ave',
 'Washington Park',
 'Washington Pl & 6 Ave',
 'Washington Pl & Broadway',
 'Washington St',
 'Washington St & Gansevoort St',
 'Water - Whitehall Plaza',
 'Waterbury St & Stagg St',
 'Watts St & Greenwich St',
 'West Drive & Prospect Park West',
 'West End Ave & W 107 St',
 'West End Ave & W 94 St',
 'West St & Chambers St',
 'West Thames St',
 'White St & Johnson Ave',
 'White St & Moore St',
 'William St & Pine St',
 'Willoughby Ave & Hall St',
 'Willoughby Ave & Myrtle Ave',
 'Willoughby Ave & Tompkins Ave',
 'Willoughby Ave & Walworth St',
 'Willoughby Ave & Wyckoff Ave',
 'Willoughby St & Fleet St',
 'Withers St & Kingsland Ave',
 'Wolcott St & Dwight St',
 'Wyckoff Av & Jefferson St',
 'Wyckoff St & 3 Ave',
 'Wyckoff St & Bond St',
 'Wythe Ave & Metropolitan Ave',
 'Yankee Ferry Terminal',
 'York St'}

NYCBS Depot locations are test locations and should be removed from the dataset

In [125]:
start_station_test = df_cb.loc[(df_cb.start_station_name.notna()) & (df_cb.start_station_name.str.contains("NYCBS"))]
print(start_station_test.shape)
df_cb = df_cb.loc[~df_cb.index.isin(start_station_test.index)]
(26, 23)
In [126]:
end_station_test = df_cb.loc[(df_cb.end_station_name.notna()) & (df_cb.end_station_name.str.contains("NYCBS"))]
print(end_station_test.shape)
df_cb = df_cb.loc[~df_cb.index.isin(end_station_test.index)]
(368, 23)

Check that the same station_id always matches the same station_name (Extra)

In [57]:
start_st = (
    df_cb[
        [
            "start_station_id",
            "start_station_name",
            "start_station_latitude",
            "start_station_longitude",
        ]
    ]
    .drop_duplicates()
    .dropna()
)
display(
    start_st.loc[start_st.start_station_name.duplicated(keep=False)].sort_values(
        "start_station_name"
    )
)
end_st = (
    df_cb[
        [
            "end_station_id",
            "end_station_name",
            "end_station_latitude",
            "end_station_longitude",
        ]
    ]
    .drop_duplicates()
    .dropna()
)
display(
    end_st.loc[end_st.end_station_name.duplicated(keep=False)].sort_values(
        "end_station_name"
    )
)
start_station_id start_station_name start_station_latitude start_station_longitude
1480471 250.0 Lafayette St & Jersey St 40.724561 -73.995653
2345157 3427.0 Lafayette St & Jersey St 40.724305 -73.996010
72 382.0 University Pl & E 14 St 40.734927 -73.992005
5782586 3812.0 University Pl & E 14 St 40.734814 -73.992085
1917506 3809.0 W 55 St & 6 Ave 40.763201 -73.978458
4525541 3809.0 W 55 St & 6 Ave 40.763189 -73.978434
end_station_id end_station_name end_station_latitude end_station_longitude
1480222 250.0 Lafayette St & Jersey St 40.724561 -73.995653
2345410 3427.0 Lafayette St & Jersey St 40.724305 -73.996010
207 382.0 University Pl & E 14 St 40.734927 -73.992005
5780630 3812.0 University Pl & E 14 St 40.734814 -73.992085
1915288 3809.0 W 55 St & 6 Ave 40.763201 -73.978458
4525723 3809.0 W 55 St & 6 Ave 40.763189 -73.978434

Start Stations

In [75]:
col_of_int = "start_station_name"
col_title = "Start Station Name"
df_start_stations = alt_hist(df_cb, col_of_int)

p = alt.Chart(df_start_stations).mark_bar().encode(
    x=alt.X("col_count", title="Count", axis=alt.Axis(format="s")),
    y=alt.Y(
        col_of_int,
        title = col_title,
        sort=alt.EncodingSortField(
            field="col_count", 
            order="descending",  
        ),
        axis= alt.Axis(labels=False)
    ),
    tooltip=[
        alt.Tooltip("col_count", title="Count", format="~s"),
        alt.Tooltip(col_of_int, title=col_title),
    ],    
)

standard_props(p)
Out[75]:

End Stations

In [76]:
col_of_int = "end_station_name"
col_title = "End Station Name"
df_end_stations = alt_hist(df_cb, col_of_int)

p = alt.Chart(df_end_stations).mark_bar().encode(
    x=alt.X("col_count", title="Count", axis=alt.Axis(format="s")),
    y=alt.Y(
        col_of_int,
        title = col_title,
        sort=alt.EncodingSortField(
            field="col_count", 
            order="descending",  
        ),
        axis= alt.Axis(labels=False)
    ),
    tooltip=[
        alt.Tooltip("col_count", title="Count", format="~s"),
        alt.Tooltip(col_of_int, title=col_title),
    ],    
)

standard_props(p)
Out[76]:

User Type, Birth Year, and Gender

Sanity Checks

Check for null values

In [145]:
df_cb[["usertype", "birth_year", "gender"]].isna().sum()
Out[145]:
usertype      0
birth_year    0
gender        0
dtype: int64

User Type

In [77]:
df_cb.usertype.value_counts(dropna=False)
Out[77]:
Subscriber    5713907
Customer      1256281
Name: usertype, dtype: int64

Gender

In [78]:
df_cb.gender.value_counts(dropna=False)
Out[78]:
1    4583435
2    1759218
0     627535
Name: gender, dtype: int64

I am assuming that 1 is male, 2 is female, and 0 is unknown/other

Birth Year

In [79]:
col_of_int = "birth_year"
col_title = "Birth Year"
df_birth_year = alt_hist(df_cb, col_of_int)

p = (
    alt.Chart(df_birth_year)
    .mark_bar()
    .encode(
        x=alt.X(col_of_int, title=col_title, axis=alt.Axis(format="c")),
        y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s")),
        tooltip=[
            alt.Tooltip("col_count", title="Count", format="~s"),
            alt.Tooltip(col_of_int, title=col_title),
        ],
    )
)
standard_props(p).interactive()
Out[79]:

In [81]:
p = (
    alt.Chart(df_birth_year.loc[df_birth_year.birth_year < 1940])
    .mark_bar()
    .encode(
        x=alt.X(col_of_int, title=col_title, axis=alt.Axis(format="c")),
        y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s")),
        tooltip=[
            alt.Tooltip("col_count", title="Count", format="~s"),
            alt.Tooltip(col_of_int, title=col_title),
        ],
    )
)
standard_props(p).interactive()
Out[81]:

In [82]:
p = (
    alt.Chart(df_birth_year.loc[df_birth_year.birth_year > 1995])
    .mark_bar()
    .encode(
        x=alt.X(col_of_int, title=col_title, axis=alt.Axis(format="c")),
        y=alt.Y("col_count", title="Count", axis=alt.Axis(format="s")),
        tooltip=[
            alt.Tooltip("col_count", title="Count", format="~s"),
            alt.Tooltip(col_of_int, title=col_title),
        ],
    )
)
standard_props(p).interactive()
Out[82]:

I would assume that 1969 is the default birth year value, hence its spike. To prevent them from skewing any analysis, I will replace them with null values. There are definitely some false values in that dataset, as I doubt there are any bikers over 120 years old, but there is no obvious cutoff. However, since they are on the edge of the data they should influence any trends.

In [83]:
df_cb.loc[df_cb.birth_year==1969, "birth_year"] = np.NAN

Relationships to Trip Duration

Correlation

In [187]:
df_cb_corr = df_cb.corr()

In [206]:
df_cb_corr_melt = (
    df_cb_corr.reset_index().rename(columns={"index": "var"}).melt(id_vars=["var"])
)
p = (
    alt.Chart(df_cb_corr_melt)
    .mark_square()
    .encode(
        x=alt.X("var", title=None),
        y=alt.Y("variable", title=None),
        color=alt.Color("value"),
        tooltip=[
            alt.Tooltip("var", title="Variable 1"),
            alt.Tooltip("variable", title="Variable 2"),
            alt.Tooltip("value", title="Correlation", format=".2%"),
        ],
    )
)
standard_props(p)
Out[206]:

Trip Duration vs Trip Distance

Calculate Trip Distance

In [84]:
df_latlong = (
    df_cb[
        [
            "start_station_latitude",
            "start_station_longitude",
            "end_station_longitude",
            "end_station_latitude",
        ]
    ]
    .drop_duplicates()
    .copy()
)
df_latlong.shape
Out[84]:
(286486, 4)
In [85]:
df_latlong["start_lat_long"] = list(
    zip(df_latlong.start_station_latitude, df_latlong.start_station_longitude)
)
df_latlong["end_lat_long"] = list(
    zip(df_latlong.end_station_latitude, df_latlong.end_station_longitude)
)
In [86]:
df_latlong["trip_distance"] = df_latlong.apply(
    lambda x: distance.distance(x.start_lat_long, x.end_lat_long).miles, axis=1
)
In [87]:
df_cb = df_cb.merge(
    df_latlong,
    on=[
        "start_station_latitude",
        "start_station_longitude",
        "end_station_longitude",
        "end_station_latitude",
    ],
    how="left"
)

Plot Relationship

In [88]:
df_dist_time = df_cb.loc[df_cb.tripduration < (60 * 60)][
    ["trip_distance", "tripduration"]
].sample(5 * 10 ** 3, random_state=44)

p = (
    alt.Chart(df_dist_time)
    .mark_circle()
    .encode(
        x=alt.X("trip_distance", title="Trip Distance (Miles)"),
        y=alt.Y(
            "tripduration", title="Trip Duration (Seconds)", axis=alt.Axis(format="s")
        ),
        tooltip=[
            alt.Tooltip("tripduration", title="Duration (Seconds)", format="s"),
            alt.Tooltip("trip_distance", title="Distance (Miles)"),
        ],
    )
)
standard_props(p)
Out[88]:

Trip Duration vs Birth Year

In [92]:
df_time_birth = df_cb.loc[df_cb.tripduration < (60 * 60)][
    ["birth_year", "tripduration"]
].sample(5 * 10 ** 3, random_state=44)

p = (
    alt.Chart(df_time_birth)
    .mark_circle(opacity=0.3)
    .encode(
        x=alt.X("birth_year", title="Birth Year", scale=alt.Scale(zero=False), axis=alt.Axis(format="d")),
        y=alt.Y(
            "tripduration", title="Trip Duration (Seconds)", axis=alt.Axis(format="s")
        ),
        tooltip=[
            alt.Tooltip("tripduration", title="Duration (Seconds)", format="s"),
            alt.Tooltip("birth_year", title="Birth Year"),
        ],
    )
)
standard_props(p)
Out[92]:

Trip Duration vs Time of Day

In [186]:
col_of_int = "start_hour"
col_title = "Hour of Day"
df_dur_start = df_cb.loc[df_cb.tripduration < (60 * 60)][
    ["start_hour", "tripduration"]
].sample(5 * 10 ** 3, random_state=44)

p = (
    alt.Chart(df_dur_start)
    .mark_circle(opacity=0.3)
    .encode(
        x=alt.X(col_of_int, title=col_title, scale=alt.Scale(zero=False), axis=alt.Axis(format="d")),
        y=alt.Y(
            "tripduration", title="Trip Duration (Seconds)", axis=alt.Axis(format="s")
        ),
        tooltip=[
            alt.Tooltip("tripduration", title="Duration (Seconds)", format="s"),
            alt.Tooltip(col_of_int, title=col_title),
        ],
    )
)
standard_props(p)
Out[186]:

Other plot ideas

  • Map station id vs use

Most used stations

In [94]:
df_cb.start_station_id.value_counts().head()
Out[94]:
519.0     49881
497.0     41456
426.0     41150
514.0     39577
3255.0    39034
Name: start_station_id, dtype: int64
In [110]:
df_cb.tail()
Out[110]:
6970183    23:59:00
6970184    23:59:00
6970185    23:59:00
6970186    00:00:00
6970187    00:00:00
Name: starttime, dtype: object
In [123]:
df_min.dtypes
Out[123]:
starttime    object
ct            int64
dtype: object
In [141]:
 
Out[141]:
0    1900-01-01 00:00:00
1    1900-01-01 00:15:00
2    1900-01-01 00:30:00
3    1900-01-01 00:45:00
4    1900-01-01 01:00:00
             ...        
91   1900-01-01 22:45:00
92   1900-01-01 23:00:00
93   1900-01-01 23:15:00
94   1900-01-01 23:30:00
95   1900-01-01 23:45:00
Name: starttime, Length: 96, dtype: datetime64[ns]
In [151]:
df_min.starttime.apply(lambda x: x.strftime('%H:%M'))
Out[151]:
0      00:00
1      00:05
2      00:10
3      00:15
4      00:20
       ...  
280    23:35
281    23:40
282    23:45
283    23:50
284    23:55
Name: starttime, Length: 285, dtype: object
In [152]:
df_min = df_cb.loc[df_cb.start_station_id==519].starttime.dt.round("5min").dt.time.to_frame()
# df_min.starttime = pd.to_datetime(df_min.starttime, format="%H:%M:%S")
df_min["ct"] = 1
df_min = df_min.groupby("starttime", as_index=False).count()
df_min["starttime_date"] = pd.to_datetime(df_min.starttime, format="%H:%M:%S")
df_min.starttime = df_min.starttime.apply(lambda x: x.strftime('%H:%M'))

p = (
    alt.Chart(df_min)
    .mark_bar()
    .encode(
        x=alt.X("starttime_date:T", title="Start Time"),
        y=alt.Y(
            "ct", title="Count", axis=alt.Axis(format="s")
        ),
        tooltip=[
            alt.Tooltip("ct", title="Count", format="s"),
            alt.Tooltip("starttime", title="Start Time"),
        ],
    )
)
standard_props(p).interactive()
Out[152]:

In [146]:
df_min
Out[146]:
starttime ct starttime_date
0 00:00:00 30 1900-01-01 00:00:00
1 00:05:00 14 1900-01-01 00:05:00
2 00:10:00 21 1900-01-01 00:10:00
3 00:15:00 31 1900-01-01 00:15:00
4 00:20:00 26 1900-01-01 00:20:00
... ... ... ...
280 23:35:00 34 1900-01-01 23:35:00
281 23:40:00 59 1900-01-01 23:40:00
282 23:45:00 38 1900-01-01 23:45:00
283 23:50:00 35 1900-01-01 23:50:00
284 23:55:00 26 1900-01-01 23:55:00

285 rows × 3 columns

Building a GLM

In [210]:
import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
h2o.init(min_mem_size="7G")
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "1.8.0_222"; OpenJDK Runtime Environment (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10); OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
  Starting server from /home/ubuntu/.local/share/virtualenvs/msnow_wip-lPAcKwpP/lib/python3.7/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp7ik5l4pp
  JVM stdout: /tmp/tmp7ik5l4pp/h2o_ubuntu_started_from_python.out
  JVM stderr: /tmp/tmp7ik5l4pp/h2o_ubuntu_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: 02 secs
H2O cluster timezone: America/New_York
H2O data parsing timezone: UTC
H2O cluster version: 3.26.0.5
H2O cluster version age: 1 month and 14 days
H2O cluster name: H2O_from_python_ubuntu_qkdl0e
H2O cluster total nodes: 1
H2O cluster free memory: 6.983 Gb
H2O cluster total cores: 8
H2O cluster allowed cores: 8
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: None
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.7.4 final
In [216]:
df_cb.columns
Out[216]:
Index(['tripduration', 'starttime', 'stoptime', 'start_station_id',
       'start_station_name', 'start_station_latitude',
       'start_station_longitude', 'end_station_id', 'end_station_name',
       'end_station_latitude', 'end_station_longitude', 'bikeid', 'usertype',
       'birth_year', 'gender', 'calculatedduration', 'duration_diff',
       'tripduration_min', 'start_month', 'start_day', 'start_date',
       'start_dow', 'start_hour', 'start_lat_long', 'end_lat_long',
       'trip_distance'],
      dtype='object')
In [258]:
outcome_y = "tripduration_min"
train_sz = 5 * 10 ** 5
test_sz =3 * 10 ** 5
df_cb_model = df_cb[
    [
        "tripduration_min",
        "starttime",
        "start_station_id",
        "start_station_name",
        "start_station_latitude",
        "start_station_longitude",
        "bikeid",
        "usertype",
        "birth_year",
        "gender",
        "start_month",
        "start_day",
        "start_date",
        "start_dow",
        "start_hour",
    ]
].copy()
df_cb_model = df_cb_model.loc[df_cb_model.tripduration_min<60*4]
df_train, df_test = train_test_split(df_cb_model, train_smp=train_sz, test_smp=test_sz)
df_train.shape, df_test.shape
Out[258]:
((500000, 15), (300000, 15))
In [259]:
train_h2o = h2o.H2OFrame(df_train)
train_h2o[outcome_y] = train_h2o[outcome_y]
test_h2o = h2o.H2OFrame(df_test)
test_h2o[outcome_y] = test_h2o[outcome_y]
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
In [228]:
linear_param = {
    "max_runtime_secs" : 25
    ,"nfolds":3
    ,"lambda_":0
    ,"remove_collinear_columns":True
    ,"alpha":0
#     ,"family":'binomial'
    ,"standardize": True
}
In [260]:
x_cols = [x for x in df_cb_model.columns if x not in  [outcome_y]]
h2o_glm = H2OGeneralizedLinearEstimator()
h2o_glm.train(x=x_cols, y=outcome_y, training_frame=train_h2o)
glm Model Build progress: |███████████████████████████████████████████████| 100%
In [261]:
h2o_glm.model_performance(train_h2o)
ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 158.74195195516288
RMSE: 12.599283787388982
MAE: 8.076047099594737
RMSLE: 0.7208705305213903
R^2: 0.11791842179799916
Mean Residual Deviance: 158.74195195516288
Null degrees of freedom: 499999
Residual degrees of freedom: 499707
Null deviance: 89981446.09184757
Residual deviance: 79370975.97758144
AIC: 3953166.503275806
Out[261]:

In [262]:
h2o_glm.model_performance(test_h2o)
ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 160.1807017088014
RMSE: 12.656251487261203
MAE: 8.085422260509482
RMSLE: 0.7213360372238161
R^2: 0.11702819348970417
Mean Residual Deviance: 160.1807017088014
Null degrees of freedom: 299999
Residual degrees of freedom: 299707
Null deviance: 54423424.43751838
Residual deviance: 48054210.51264042
AIC: 2374841.889014085
Out[262]:

In [263]:
df_test["predicted_duration"] = h2o_glm.predict(test_h2o).as_data_frame()
glm prediction progress: |████████████████████████████████████████████████| 100%
In [252]:
df_test.columns
Out[252]:
Index(['tripduration_min', 'starttime', 'start_station_id',
       'start_station_name', 'start_station_latitude',
       'start_station_longitude', 'bikeid', 'usertype', 'birth_year', 'gender',
       'start_month', 'start_day', 'start_date', 'start_dow', 'start_hour',
       'predicted_duration'],
      dtype='object')
In [270]:
df_tmp = df_test.sample(5 * 10 ** 3, random_state=44)

p = (
    alt.Chart(df_tmp)
    .mark_circle(clip=True)
    .encode(
        x=alt.X(
            "tripduration_min",
            title="Trip Duration (Minutes)",
            scale=alt.Scale(zero=False, domain=(0,40)),
            axis=alt.Axis(format="d"),
        ),
        y=alt.Y(
            "predicted_duration",
            title="Predicted Duration (Minutes)",
            axis=alt.Axis(format="d"),
            scale=alt.Scale(domain=(0,40))
        ),
        tooltip=[
            alt.Tooltip("tripduration_min", title="Duration", format="s"),
            alt.Tooltip("predicted_duration", title="Predicted Duration"),
        ],
    )
)
standard_props(p)
Out[270]:

In [272]:
df_test.reset_index(drop=True).to_feather("../downloads/model_predictions.feather")